Simulation study

DIABLO selects correlated and discriminatory variables

Briefly, three omic datasets consisting of 200 samples (split equally over two groups) and 260 variables were generated by modifying the degree of correlation and discrimination, resulting in four types of variables: 30 correlated-discriminatory (corDis) variables, 30 uncorrelated-discriminatory (unCorDis) variables, 100 correlated-nondiscriminatory (corNonDis) variables, and 100 uncorrelated-nondiscriminatory (unCorNonDis) variables.

Generate different types of variables and apply diablo to each type separately

  • 3 datasets (effective sample size = 100; group1=100 observations, group2=100 observations)
  • each dataset has four types of variables; lets explore them now
  • 30 variables that contribute to correlated & discriminatory components
  • 30 variables that contribute to correlated & nondiscriminatory components
  • 100 variables that contribute to uncorrelated & discriminatory components
  • 100 variables that contribute to uncorrelated & nondiscriminatory components

Correlation structure for each set of simulated variables

corDis

corNonDis

unCorDis

unCorNonDis

Simulation: vary noise and fold-change and compare with other schemes (concatenation/ensembles)

Three integrative classification methods were applied to generate multi-omic biomarkers panels of 90 variables each (30 variables from each omic dataset): a DIABLO model with either a full design (where the correlation between all pairwise combinations of datasets, as well as between each dataset and the phenotypic outcome, were maximised) or the null design (where only the correlation between each dataset and the phenotypic outcome was maximised, see Methods), a concatenation-based sPLSDA classifier which consists of naively combining all datasets into one, and an ensemble of sPLSDA classifiers where a separate sPLSDA classifier was fitted for each omics dataset and the consensus predictions were combined using a majority vote scheme.

Integrative prediction frameworks including multi-step approaches (concatenation, ensemble) and DIABLO to identify multi-omics molecular signatures. Concatenation-based integration combines multiple datasets into a single large dataset, with the aim to predict a phenotype of interest. Ensemble-based classification methods construct a predictive model on each individual dataset before combining the model predictions. None of these approaches account or model relationships between datasets and thus limit our understanding of molecular interactions at multiple functional levels. DIABLO simultaneously maximizes the associations between datasets and a phenotype of interest to identify a correlated set of variables of different omics-types that are also discriminatory. The prediction is based on each omics-associated component derived from the model (see Supplementary Note). All methods presented here are data-driven approaches, which do not use any prior knowledge such as from curated biological databases (eg. protein-protein interactions).

The purpose of the simulation study was to compare DIABLO models with existing multi-step integrative classifiers with respect to the error rate and types of variables selected as part of the multi-omic biomarker panels. A secondary aim was to determine the effect of design matrix on the resulting multi-omic biomarker panels identified using DIABLO. The concatenation, ensemble and DIABLO_null classifiers performed similarly across the various noise and fold-change thresholds. At lower noise levels (simulated using a multivariate normal distribution with mean of zero and standard deviation of 0.2 or 0.5) the DIABLO_full classifier had a slightly higher error rate compared to the other approaches, but consistently selected mostly correlated and discriminatory (corDis) variables, unlike the other integrative classifiers. All methods behaved similarly with respect to the error rate and types of variables selected at higher noise thresholds (simulated using a multivariate normal distribution with mean of zero and standard deviation of 1 or 2). This simulation highlights how the design (connection between datasets) affects the flexibility of the DIABLO model, resulting in a trade-off between discrimination or correlation. DIABLO_null focused on selecting discriminatory variables and disregarded most of the correlation between datasets (null design), whereas DIABLO_full selected highly correlated variables across all datasets. Since the variables selected by DIABLO_full reflect the correlation structure between biological compartments, we hypothesized that they might provide a balance between prediction accuracy and biological insight.

Simulation study: performance assessment and benchmarking. Simulated datasets included different types of variables: correlated & discriminatory (corDis); uncorrelated & discriminatory (unCorDis); correlated & nondiscriminatory (corNonDis) and uncorrelated & nondiscriminatory (unCorNonDis) for different fold-changes between sample groups and different noise levels (see Supplementary Note). Integrative classifiers included DIABLO with either the full or null design, concatenation and ensemble-based sPLSDA classifiers and were all trained to select 90 variables across three multi-omics datasets. a) Classification error rates (10-fold cross-validation averaged over 50 simulations). Dashed line indicates a random performance (error rate = 50%). All methods perform similarly with the exception of DIABLO_full which has a higher error rate. b) Number of variables selected according to their type. DIABLO_full selected mainly variables that were correlated & discriminatory (corDis, red), whereas the other methods selected an equal number of correlated or uncorrelated discriminatory variables (corDis and unCorDis, red and blue).